ISSS608 Visual Analytics & Applications Coursework
by Wilson Tan
  • Hands-on Exercises
    • Week 1: Hands-on Exercise
    • Week 2: Hands-on Exercise
    • Week 3: Hands-on Exercise
    • Week 4: Hands-on Exercise
  • In-class Exercises
    • Week 1: In-class Exercise
    • Week 2: In-class Exercise
    • Week 3: In-class Exercise
    • Week 4: In-class Exercise
    • Week 5: In-class Exercise
    • Week 6: In-class Exercise
    • Week 7: In-class Exercise
  • Take-home Exercises
    • Take-home Exercise 01
    • Take-home Exercise 02
    • Take-home Exercise 03

Take-home Exercise 02

Detecting illegal, unreported, and unregulated (IUU) fishing

FishEye International has provided import/export data ranging from the year 2028 to 2034 regarding Oceanus’ fishing industries. This exercise attempts to make use of the data provided to detect those who are involved in IUU fishing, in particular, to answer the following question:

  • Identify companies that fit a pattern of illegal fishing. Use visualizations to support your conclusions and your confidence in them.

Import data

The following code chunk is used to import the data. Since the provided data is in .json format, the fromJSON() function is used:

json_file_path <- "data/mc2_challenge_graph.json"
mc2_file_path <- "data/mc2.rds"

if (!file.exists(mc2_file_path)) {
  mc2 <- fromJSON(json_file_path)
  saveRDS(mc2, mc2_file_path)
} else {
  mc2 <- readRDS(mc2_file_path)
}
Quicken the import process

Importing data from the .json file takes time. Hence, an if-else loop is written here to ensure that the .json file only has to be imported once, after which, it will be saved as a .rds file. If the .rds file already exists, then it can be loaded directly with no need to re-run the .json file.

Data wrangling

Now that the data has been imported, we can load them as tibble dataframes. The select() function is used to select the relevant columns only and at the same time to re-order them into the desired order.

mc2_nodes <- as_tibble(mc2$nodes) %>%
  select(id, shpcountry, rcvcountry)

mc2_edges <- as_tibble(mc2$links) %>%
  mutate(ArrivalDate = ymd(arrivaldate)) %>%
  mutate(Year = year(ArrivalDate)) %>%
  select(
    source,
    target,
    ArrivalDate,
    Year,
    hscode,
    valueofgoods_omu,
    volumeteu,
    weightkg,
    valueofgoodsusd
  ) %>%
  distinct()

For every edge, it specifies a trade between a source and a target. The following code helps to yield a dataframe mc2_nodes_extracted of unique entities that appear on either ends of every edge (either a source or target of every trading relationship). At the same time, since some of the names of entities are rather long, a cid is generated as an auto-incremented company ID, so that the companies can be referred to more easily.

mc2_nodes_extracted <- union(unique(mc2_edges$source),
                             unique(mc2_edges$target)) %>% sort() %>% as_tibble()
colnames(mc2_nodes_extracted) <- "name"
mc2_nodes_extracted <- mc2_nodes_extracted %>%
  mutate(cid = row_number())

Since every edge represents one transaction, having some sort of grouping would help to reduce the number of edges in the network.

A grp_hscode variable is generated as the first digit of the column hscode. HS codes are 6 digit numbers specifying the exact type of good that is being traded. As there are many types of goods, plotting them all based on hscode would be too messy. By extracting only the first digit of HS codes, some insight may still be gleaned from these broader categories (there will only be 9 categories, 1 for each digit).

Thereafter, the edges are grouped by source and target companies, grp_hscode and Year. The number of trades (num_trades) and total weight in kg (total_weightkg) are summarised for each group. Only trading relationships between two companies with a frequency of more than 20 per year are included in the network. This helps to filter out the low-frequency traders that are less likely to have substantial impact on the industry.

mc2_edges$grp_hscode <- substr(mc2_edges$hscode, 1, 1)

mc2_edges_agg <- mc2_edges %>%
  group_by(source, target, grp_hscode, Year) %>%
  summarise(num_trades = n(),
            total_weightkg = sum(weightkg)) %>%
  filter(source != target) %>%
  filter(num_trades > 20) %>%
  ungroup()
Missing description on hscode

Though the data dictionary specifies that more information can be gleaned by merging with the hscode table, there is no such table that can be found from the downloads. As such, we do not have the description of the types of goods being traded, and will have to assume for this exercise that all goods provided in this dataset are fish/marine life-related.

Plotting the network graph

Now, a tbl_graph object will be created for the purpose of plotting the network graph. At the same time, the betweenness centrality and out-degree centrality of each company (across all years of trade) will be generated for each company (represented as a node). Betweenness and out-degree are both determined based on number of trades.

The reason for these measures are for the detection of potential IUU fishing:

  • Companies with positive betweenness centrality would mostly likely be acting as intermediaries or distributors. Such companies would be important links for the industry network. Hence, they are less likely to be the ones involved in IUU fishing. On the contrary, if a company has zero betweenness centrality, it is unlikely to serve as an important link for the industry network.

  • Companies with positive out-degree centrality, in addition to having zero betweenness centrality, would likely be fishing companies who sell what they catch. However, legitimate fishing companies who are not trying to keep a low profile would in all likelihood also engage in some sort of buying (e.g., for bait, ship components, or for other business activities), and thus not have zero betweenness centrality.

  • Therefore, companies with zero betweenness centrality and positive out-degree centrality can be considered suspicious.

mc2_graph <- tbl_graph(nodes = mc2_nodes_extracted,
                       edges = mc2_edges_agg,
                       directed = TRUE) %>%
  activate(nodes) %>%
  mutate(betweenness_centrality = centrality_betweenness(weights = num_trades)) %>%
  mutate(outdegree_centrality = centrality_degree(weights = num_trades,
                                                  mode = "out"))

The following code chunk creates the ggraph object for the year 2028. The for-loop is designed for scalability (adding years into the years vector will create the respective years’ ggraph objects separately).

The following elements are parts of the design of the network:

  • Betweenness and out-degree centrality are re-generated for each node as the data is now filtered by each year.

  • filter(!node_is_isolated()) function helps to remove nodes that have no corresponding edges for the year.

  • Edge widths represent the number of trades.

  • Edge colours represent the type of goods being traded, based on grp_hscode.

  • Node sizes represent the betweenness centrality.

  • Node fills represent whether the out-degree centrality is zero or non-zero.

  • Interactive elements will be explained later on.

years = c("2028")

for (y in years) {
  mygraph <- paste("mc2", "graph", y, sep = "_")
  assign(
    mygraph,
    mc2_graph %>%
      activate(edges) %>%
      filter(Year == y) %>%
      activate(nodes) %>%
      filter(!node_is_isolated()) %>%
      mutate(betweenness_centrality = centrality_betweenness(weights = num_trades)) %>%
      mutate(outdegree_centrality = centrality_degree(weights = num_trades,
                                                      mode = "out"))
  )

  assign(
    paste("g", y, sep = "_"),
    ggraph(get(mygraph),
           layout = "nicely") +
      geom_edge_link(aes(width = num_trades,
                         color = grp_hscode),
                     alpha = 0.6) +
      scale_edge_width(range = c(0.4, 4), name = "Total weight") +
      scale_edge_color_brewer(name = "HS code group",
                              palette = "Set1") +
      geom_point_interactive(
        aes(
          x = x,
          y = y,
          tooltip = paste0(
            "Name:  ", name,
            "\nCompany ID:  ", cid,
            "\nOut-degree:  ", outdegree_centrality,
            "\nBetweenness:  ", betweenness_centrality
          ),
          data_id = outdegree_centrality > 0,
          size = betweenness_centrality,
          fill = outdegree_centrality > 0
        ),
        colour = "grey20",
        shape = 21,
        alpha = 0.8
      ) +
      scale_fill_manual(labels = c("Zero", "Non-zero"), values = c("cyan", "firebrick1"), name = "Out-degree") +
      scale_size_continuous(range = (c(1, 10)), name = "Betweenness") +
      theme_graph(
        foreground = "grey20",
      ) +
      labs(title = y) +
      theme(plot.title = element_text(size = 11))
  )
}

rm(y, years, mygraph)

Network graph for year 2028

To have a sense of the scale of the network, only data from the year 2028 will be plotted using the following code chunk.

girafe(ggobj = g_2028,
       options = list(opts_hover(css = "fill:;"),
                      opts_hover_inv(css = "opacity: 0.2;"),
                      opts_selection(type = "multiple", only_shiny = FALSE,
                                     css = "opacity:1;"),
                      opts_selection_inv(css = "opacity:0;")))
Interactive elements

A guide on how to interact with the network:

  • Mouseover each node to view a tooltip with information regarding the node.

  • Click on a node of a specific colour to view only the nodes belonging to that colour (which specifies whether out-degree is zero or non-zero).

For this network plot, the nodes that are small red dots are of interest, as they represent companies with zero betweenness centrality and positive out-degree centrality.

Further identifying suspicious companies

Furthermore, FishEye knows from past experience that companies caught fishing illegally will shut down but will then often start up again under a different name. As such, to make the network plot clearer, the companies that close down prematurely before 2034 (the end-year of the given dataset) will be identified.

suspects <-
  mc2_graph %>% activate(nodes) %>% data.frame() %>% tibble() %>%
  filter(betweenness_centrality == 0) %>%
  filter(outdegree_centrality > 0)

companies_closed_down_1 <- mc2_edges_agg %>%
  group_by(source) %>%
  summarise(year_of_closure = max(Year))
companies_closed_down_2 <- mc2_edges_agg %>%
  group_by(target) %>%
  summarise(year_of_closure = max(Year))

companies_closed_down <-
  merge(
    companies_closed_down_1,
    companies_closed_down_2,
    by.x = "source",
    by.y = "target",
    all = TRUE
  )
companies_closed_down$year_of_closure <-
  pmax(
    companies_closed_down$year_of_closure.x,
    companies_closed_down$year_of_closure.y,
    na.rm = TRUE
  )
suspects <- companies_closed_down %>%
  select(name = source, year_of_closure) %>%
  filter(year_of_closure < max(mc2_edges_agg$Year)) %>%
  filter(name %in% suspects$name)

rm(companies_closed_down_1, companies_closed_down_2, companies_closed_down)

There are 6299 companies in the dataset. 1545 companies with zero betweenness centrality and positive out-degree centrality have closed down prematurely. This helps to cut down the number of nodes that are plotted.

These nodes will be labelled as suspicious, which is a variable that will take the value of “Yes” if the company fits the criteria, and “No” otherwise.

mc2_graph <- mc2_graph %>%
  activate(nodes) %>%
  mutate(suspicious = ifelse(name %in% suspects$name, "Yes", "No"))

The new networks will be plotted, using data from the years 2028, 2030, and 2032 (not every year is plotted due to memory limitations).

Show the code
years = c("2028", "2030", "2032")

for (y in years) {
  mygraph <- paste("mc2", "graph", y, sep = "_")
  assign(
    mygraph,
    mc2_graph %>%
      activate(edges) %>%
      filter(Year == y) %>%
      activate(nodes) %>%
      filter(!node_is_isolated()) %>%
      mutate(betweenness_centrality = centrality_betweenness(weights = num_trades)) %>%
      mutate(outdegree_centrality = centrality_degree(weights = num_trades,
                                                      mode = "out"))
  )

  assign(
    paste("g", y, sep = "_"),
    ggraph(get(mygraph),
           layout = "nicely") +
      geom_edge_link(aes(width = num_trades,
                         color = grp_hscode),
                     alpha = 0.6) +
      scale_edge_width(range = c(0.4, 4), name = "Total weight") +
      scale_edge_color_brewer(name = "HS code group",
                              palette = "Set1") +
      geom_point_interactive(
        aes(
          x = x,
          y = y,
          tooltip = paste0(
            "Name:  ", name,
            "\nCompany ID:  ", cid,
            "\nOut-degree:  ", outdegree_centrality,
            "\nBetweenness:  ", betweenness_centrality,
            "\nSuspicious?:  ", suspicious
          ),
          data_id = suspicious,
          size = betweenness_centrality,
          fill = suspicious
        ),
        colour = "grey20",
        shape = 21,
        alpha = 0.8
      ) +
      scale_fill_manual(values = c("cyan", "firebrick1"), name = "Suspicious?") +
      scale_size_continuous(range = (c(1, 10)), name = "Betweenness") +
      theme_graph(
        foreground = "grey20",
      ) +
      labs(title = y) +
      theme(plot.title = element_text(size = 11))
  )
}

rm(y, years, mygraph)

Network plots for 2028, 2030, and 2032

  • 2028
  • 2030
  • 2032
Companies suspected of IUU fishing

These are the red nodes within the above network plots. For starters, the less connected nodes on the perimeter of the network can be explored first.

The list of suspicious companies can be found in the data table below:

mc2_graph %>%
  activate(nodes) %>%
  data.frame() %>%
  filter(suspicious == "Yes") %>%
  select(name, cid, betweenness_centrality, outdegree_centrality) %>%
  datatable(options = list(order = list(list(4, 'desc'))))

Making the network plot even smaller

Lastly, with clearer clarity on the hscode and which are the goods most associated with IUU fishing, the network can be made smaller.

For example, focusing only on goods with grp_hscode == 3:

Show the code
years = c("2028")

for (y in years) {
  mygraph <- paste("mc2", "graph", y, sep = "_")
  assign(
    mygraph,
    mc2_graph %>%
      activate(edges) %>%
      filter(Year == y) %>%
      filter(grp_hscode == "3") %>%
      activate(nodes) %>%
      filter(!node_is_isolated()) %>%
      mutate(betweenness_centrality = centrality_betweenness(weights = num_trades)) %>%
      mutate(outdegree_centrality = centrality_degree(weights = num_trades,
                                                      mode = "out"))
  )

  assign(
    paste("g", y, sep = "_"),
    ggraph(get(mygraph),
           layout = "nicely") +
      geom_edge_link(aes(width = num_trades,
                         color = grp_hscode),
                     alpha = 0.6) +
      scale_edge_width(range = c(0.4, 4), name = "Total weight") +
      scale_edge_color_brewer(name = "HS code group",
                              palette = "Set1") +
      geom_point_interactive(
        aes(
          x = x,
          y = y,
          tooltip = paste0(
            "Name:  ", name,
            "\nCompany ID:  ", cid,
            "\nOut-degree:  ", outdegree_centrality,
            "\nBetweenness:  ", betweenness_centrality,
            "\nSuspicious?:  ", suspicious
          ),
          data_id = suspicious,
          size = betweenness_centrality,
          fill = suspicious
        ),
        colour = "grey20",
        shape = 21,
        alpha = 0.8
      ) +
      scale_fill_manual(values = c("cyan", "firebrick1"), name = "Suspicious?") +
      scale_size_continuous(range = (c(1, 10)), name = "Betweenness") +
      theme_graph(
        foreground = "grey20",
      ) +
      labs(title = "Network plot for year 2028 and HS Group 3") +
      theme(plot.title = element_text(size = 11))
  )
}

rm(y, years, mygraph)
Show the code
girafe(ggobj = g_2028,
       options = list(opts_hover(css = "fill:;"),
                      opts_hover_inv(css = "opacity: 0.2;"),
                      opts_selection(type = "multiple", only_shiny = FALSE,
                                     css = "opacity:1;"),
                      opts_selection_inv(css = "opacity:0;")))